Load and Visualize IBM Debater® Sentiment Composition Lexicons

In this notebook you will load, explore, clean and visualize data from the IBM Debater Sentiment Composition Lexicons dataset. The dataset includes sentiment composition lexicons and sentiment lexicons:

  1. Sentiment composition lexicons containing 2,783 words.
  2. Sentiment lexicons containing 66,058 unigrams and 262,555 bigrams.

The dataset addresses sentiment composition – predicting the sentiment of a phrase from the interaction between its constituents. For example, in the phrases “reduced bureaucracy” and “fresh injury”, both “reduced” and “fresh” are followed by a negative word. However, “reduced” flips the negative polarity, resulting in a positive phrase, while “fresh” propagates the negative polarity to the phrase level, resulting in a negative phrase. Accordingly, “reduced” is part of our “reversers” lexicon, and “fresh” is part of the “propagators” lexicon.

The dataset can be obtained for free from IBM Developer Data Asset Exchange.

Table of Contents

0. Prerequisites

Before you run this notebook complete the following steps:

Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

ws-project.mov

Import required modules

Import and configure the required modules.

1. Load Data

1.1 About

The goal of this set of notebooks is to use the IBM Debater Sentiment Composition Lexicons dataset to categorize text on a sentence level, or as a whole, with a range of sentiments. This could be used in an application which collects customer feedback to help determine customer satisfaction.

Let's first explain a few definitions:

1.2 Read Data

The first step is to load, and then modify the LEXICON_UG.txt and LEXICON_BG.txt datasets to include a sentiment column that is based on the SENTIMENT_SCORE column, but uses 1 or 0, where 1 is positive sentiment and 0 is negative sentiment.

LEXICON_UG.txt:

This is a list of 66,058 unigrams and their predicted sentiment score. Note that in the paper Learning Sentiment Composition from Sentiment Lexicons, for unigrams that have sentiment in the HL lexicon (the publicly-available sentiment lexicon of Hu and Liu (2004)), the original sentiment from the HL lexicon (+1 or -1) was used, and not the predicted score. This step is not reflected in the released lexicon.

LEXICON_BG.txt:

This is a list of 262,555 selected bigrams in the following format:

2. Data Visualization

In this section, we will visualize the unigrams and bigrams datasets.

2.1 Unigrams

First, let's categorize the sentiment of each term.

We want to check the distribution of the review sentiment polarity score.

The distribution of sentiment polarity is a bell-shaped curve.

Then, let's check the unigram text length distribution.

The distribution of unigram text length is also approximately normally distributed, which means our dataset is balanced.

We are now saving the first letter of each term.

In these mext steps, we group by the first letter and sentiment score of each term, and then count amount of unigrams under each category.

Now, we visualize the distribution of the unigram count of all letters.

In the unigrams dataset, the first letter normally starts with s or c while the least frequent letters are x, y, and z.

2.2 Bigrams

Similarly, we categorize the sentiment of each term, positive or negative.

And then we visualize the sentiment polarity distribution.

We group by the POS-TAGS and sentiment score of each bigram term.

Visualize the histograms of pos-tags and show the count of positive versus negative sentiment.

3. Save the Cleaned Data

Finally, we save the cleaned dataset as a Project asset for later re-use. You should see similar to the following after saving the file:

{'file_name': 'bigrams.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'ibmdebatersentimentcompositionlex-...',
 'asset_id': '...'}

Note: In order for this step to work, your project token (see the first cell of this notebook) must have Editor role. By default this will overwrite any existing file.

Next steps

Citation

Toledo-Ronen et. al, Learning Sentiment Composition from Sentiment Lexicons, COLING 2018

Authors

This notebook was created by the Center for Open-Source Data & AI Technologies.


Copyright © 2021 IBM. This notebook and its source code are released under the terms of the MIT License.